CCSF Logo

Lecture 09: Charts¶

Associated Textbook Sections: 7.0, 7.1


Overview¶

  • W. E. B. Du Bois
  • Why Do We Visualize Data
  • Course Visualizations
  • Categorical Data
  • Numerical Data

Set Up the Notebook¶

In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

W. E. B. Du Bois¶


Background¶

W.E.B. Du Bois

The content of the following quotes, podcast/video, and images contains references to the enslavement and murder of humans.

  • Scholar, historian, activist, and data scientist

    "The Philadelphia Negro was the first scientific study of race in the world. [...] the first non-racist investigation of a non-white poulation in the world. [...] one of the first social scientific written in the U.S. using the advanced statistical methods of the time." - Dr. Tukufu Zuberi, Professor of Race Relations at the University of Pennsylvania (Source: A Legacy of Courage: W.E.B. Du Bois and the Philadelphia Negro)

  • First Black American to receive a PhD from Harvard
  • NAACP founder

Paris Exposition¶

Made a series of visualizations for the 1900 Paris Exposition

  • Goal: Change the way people see Black Americans
  • Hundreds of photographs and patents
  • 60+ handmade graphs in 3 months

"All art is propaganda, and ever must be, despite the wailing of the purists. I stand in utter shamelessness and say that whatever art I have for writing has been used always for propaganda for gaining the right of black folk to love and enjoy. I do not care a damn for any art that is not used for propaganda." - W.E.B. Du Bois

  • Ideologies of W.E.B. Du Bois and Booker T. Washington are typically compared
  • The following podcast provides an 11-minute overview of Du Bois and Washington:
In [2]:
from IPython.display import IFrame
IFrame('https://www.youtube-nocookie.com/embed/zHn-vSTMOWE?si=Qk49xkLYJyAlBY5k',
       width="560", height="315", frameBorder="0", allowfullscreen="", 
       allow="accelerometer; autoplay; clipboard-write; \
       encrypted-media; fullscreen; gyroscope; picture-in-picture; web-share", 
       loading="lazy", title="YouTube video player",
       referrerpolicy="strict-origin-when-cross-origin")
Out[2]:

Images from Paris Exposition¶

The following images are from:

  • Smithsonian Magazine - W.E.B. Du Bois’ Visionary Infographics Come Together for the First Time in Full Color
  • WBUR - W.E.B. Du Bois Created These Infographics In 1900 To Humanize The African-American Experience
Du Bois's City and Rural Populations Graph
Du Bois's Proportion of Freeman and Slaves Graphic
Du Bois's graphic of Income and Expenditures

Why Do We Visualize Data¶

  • A large fraction of our brains are dedicated to visual reasoning.
  • In Data Science we use visualization:
    • For others – to communicate our findings
    • For ourselves – to understand our data, see patterns, and discover relationships

Course Visualizations¶

  • In the course we will mostly use the following visualizations:
    • Histograms
    • Line Graphs
    • Scatter Plots
    • Bar Charts
  • You will need to overlay graphs to explore relationships
  • How you visualize your data depends on attribute type
  • The data type doesn't determine numerical/categorical attribute label.
    • '$12.00' is a str and likely to reflect a numerical attribute
    • The context of the data and analysis is important to understand

You will indirectly work with the standard Matplotlib library for data visualization using the datascience library. You can optionally interact with visualizations using the Plotly library, but customizing and creating interactive visualizations is not required and you will not be tested on these things.


Good Practices¶

  • Less can be more
    • Minimize decoration
    • Choose colors carefully: Minimize the number of different colors
  • If data are numerical, preserve their relative values and distances between them

See Edward Tufte's "The Visual Display of Quantitative Information" for additional suggestions.


Demo: Identifying Data Type of Column Values¶

The dataset top_grossing_movies.csv shows the highest 100-grossing movies worldwide listed on IMDB. Adjusted total gross values were also provided using the Consumer Price Index (CPI)-based Python library cpi to account for inflation.

In [3]:
top_movies = Table.read_table('top_grossing_movies.csv')
top_movies
Out[3]:
Created Modified Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Gross Gross (Adjusted)
2/3/23 2/9/23 Avatar https://www.imdb.com/title/tt0499549/ Movie 7.9 162 2009 Action, Adventure, Fantasy, Sci-Fi 1402235 12/18/09 James Cameron 2923706026 4.15247e+09
2/3/23 8/9/23 Avengers: Endgame https://www.imdb.com/title/tt4154796/ Movie 8.4 181 2019 Action, Adventure, Drama, Sci-Fi 1298188 4/26/19 Anthony Russo, Joe Russo 2799439100 3.33648e+09
2/3/23 5/23/23 Avatar: The Way of Water https://www.imdb.com/title/tt1630029/ Movie 7.5 192 2022 Action, Adventure, Fantasy, Sci-Fi 509990 12/16/22 James Cameron 2320246732 2.41576e+09
2/3/23 3/13/23 Titanic https://www.imdb.com/title/tt0120338/ Movie 7.9 194 1997 Drama, Romance 1304263 12/19/97 James Cameron 2255734210 4.28241e+09
2/3/23 2/3/23 Star Wars: Episode VII - The Force Awakens https://www.imdb.com/title/tt2488496/ Movie 7.8 138 2015 Action, Adventure, Sci-Fi 983853 12/18/15 J.J. Abrams 2071310218 2.66281e+09
2/3/23 2/3/23 Avengers: Infinity War https://www.imdb.com/title/tt4154756/ Movie 8.4 149 2018 Action, Adventure, Sci-Fi 1232719 4/27/18 Anthony Russo, Joe Russo 2052415039 2.49047e+09
2/3/23 6/7/23 Spider-Man: No Way Home https://www.imdb.com/title/tt10872600/ Movie 8.2 148 2021 Action, Adventure, Fantasy, Sci-Fi 905126 12/17/21 Jon Watts 1921847111 2.16109e+09
2/3/23 6/7/23 Jurassic World https://www.imdb.com/title/tt0369610/ Movie 6.9 124 2015 Action, Adventure, Sci-Fi 688018 6/12/15 Colin Trevorrow 1671537444 2.14888e+09
2/3/23 2/3/23 The Lion King https://www.imdb.com/title/tt6105098/ Movie 6.8 118 2019 Animation, Adventure, Drama, Family, Fantasy, Musical 271300 7/19/19 Jon Favreau 1663075401 1.98212e+09
2/3/23 2/3/23 The Avengers https://www.imdb.com/title/tt0848228/ Movie 8 143 2012 Action, Sci-Fi 1477106 5/4/12 Joss Whedon 1518815515 2.01567e+09

... (90 rows omitted)

The movie titles reflect a categorical attribute.

In [4]:
type(top_movies.column('Title').item(0))
Out[4]:
str

The movie years reflect a numerical attribute.

In [5]:
type(top_movies.column('Year').item(0))
Out[5]:
int

Be careful. Sometimes the data type doesn't align with the intended attribute type.

Categorical Data¶


(Horizontal) Bar charts barh are a standard way to visualize the distribution of a single categorical variable.


A Bar Chart¶

The following code uses group. We will address that later in the course. Additionally, there is customization to the visual done on the lines that start with plots. You are not responsible for this customization.

In [6]:
cones = Table().read_table('cones.csv')
cones_grouped_by_flavor = cones.group('Flavor')
cones_grouped_by_flavor.barh('Flavor')

plt.title('Distrubtion of Ice Cream Flavors')
plt.show()
No description has been provided for this image

Demo: Bar Charts¶

Reduce the table to the top 10 movies based on actual gross values ('Gross (Adjusted)') for the movies released in the last decade.

In [7]:
top_movies_select = top_movies.select('Title', 'Year', 'Gross (Adjusted)')
top_movies_last_decade = top_movies_select.where('Year', are.above(2014)) 
top_movies_last_decade_sorted = top_movies_last_decade.sort('Gross (Adjusted)', True)
top10 = top_movies_last_decade_sorted.take(np.arange(10)) # SOLUTION
top10
Out[7]:
Title Year Gross (Adjusted)
Avengers: Endgame 2019 3.33648e+09
Star Wars: Episode VII - The Force Awakens 2015 2.66281e+09
Avengers: Infinity War 2018 2.49047e+09
Avatar: The Way of Water 2022 2.41576e+09
Spider-Man: No Way Home 2021 2.16109e+09
Jurassic World 2015 2.14888e+09
The Lion King 2019 1.98212e+09
Furious 7 2015 1.94808e+09
Avengers: Age of Ultron 2015 1.80625e+09
Frozen II 2019 1.73256e+09

Returning to top grossing movies data, convert to the gross (adjusted) values in the top10 table to billions of dollars for readability.

In [8]:
billions = np.round(top10.column('Gross (Adjusted)') / 1000000000, 2)
top10 = top10.with_column('Gross Adjusted (Billions)', billions)
top10
Out[8]:
Title Year Gross (Adjusted) Gross Adjusted (Billions)
Avengers: Endgame 2019 3.33648e+09 3.34
Star Wars: Episode VII - The Force Awakens 2015 2.66281e+09 2.66
Avengers: Infinity War 2018 2.49047e+09 2.49
Avatar: The Way of Water 2022 2.41576e+09 2.42
Spider-Man: No Way Home 2021 2.16109e+09 2.16
Jurassic World 2015 2.14888e+09 2.15
The Lion King 2019 1.98212e+09 1.98
Furious 7 2015 1.94808e+09 1.95
Avengers: Age of Ultron 2015 1.80625e+09 1.81
Frozen II 2019 1.73256e+09 1.73

Visualize the gross adjusted values for each of the top 10 grossing (adjusted) movies.

In [9]:
top10.barh('Title', 'Gross (Adjusted)')

plt.title("The Top 10 Grossing Movies")
plt.show()
No description has been provided for this image

Visual Perception Accuracy¶


From Nathan Yau’s Data Points: Visualization that Means Something, our eyes can extract information at different levels of accuracy depending on the design.

Visualizations ordered by levels of accuracy

For this reason, pie charts are generally discouraged because most people have a difficult time visually interpreting angles compared to lengths of bars.


Demo: Visualizing Du Bois¶

Read the du_bois.csv data as a table, reformat the data, and create a stacked bar chart.

In [10]:
du_bois = Table.read_table('du_bois.csv')
du_bois.set_format('RENT', PercentFormatter)
du_bois.set_format('FOOD', PercentFormatter)
du_bois.set_format('CLOTHES', PercentFormatter)
du_bois.set_format('TAXES', PercentFormatter)
du_bois.set_format('OTHER', PercentFormatter)
du_bois
Out[10]:
CLASS ACTUAL AVERAGE RENT FOOD CLOTHES TAXES OTHER STATUS
100-200 139.1 19.00% 43.00% 28.00% 0.10% 9.90% POOR
200-300 249.45 22.00% 47.00% 23.00% 4.00% 4.00% POOR
300-400 335.66 23.00% 43.00% 18.00% 4.50% 11.50% FAIR
400-500 433.82 18.00% 37.00% 15.00% 5.50% 24.50% FAIR
500-750 547 13.00% 31.00% 17.00% 5.00% 34.00% COMFORTABLE
750-1000 880 0.00% 37.00% 19.00% 8.00% 36.00% COMFORTABLE
1000 and over 1125 0.00% 29.00% 16.00% 4.50% 50.50% WELL-TO-DO

Notice that the table is formatted to show percentages, but the values in the % columns are actually floats.

In [ ]:
du_bois.column('RENT')
In [ ]:
type(du_bois.column('RENT').item(0))

For a quick review, find the income bracket (CLASS) that spent the highest percentage of their income on rent.

In [ ]:
...

Start to re-create the bar chart that Du Bois presented in Paris.

In [11]:
du_bois_for_bar = du_bois.drop('ACTUAL AVERAGE', 'STATUS')
du_bois_for_bar
Out[11]:
CLASS RENT FOOD CLOTHES TAXES OTHER
100-200 19.00% 43.00% 28.00% 0.10% 9.90%
200-300 22.00% 47.00% 23.00% 4.00% 4.00%
300-400 23.00% 43.00% 18.00% 4.50% 11.50%
400-500 18.00% 37.00% 15.00% 5.50% 24.50%
500-750 13.00% 31.00% 17.00% 5.00% 34.00%
750-1000 0.00% 37.00% 19.00% 8.00% 36.00%
1000 and over 0.00% 29.00% 16.00% 4.50% 50.50%
In [12]:
du_bois_for_bar.barh('CLASS')
                     
# Some extra graph formatting you are not responsible for
plt.title('W.E. Du Bois Income and Expenditure')
plt.show()
No description has been provided for this image

[Optional] Interactive Charts with Plotly¶

  • By default, we will be using the static visualizations that are made using the Matplotlib library.
  • You have the ability to access interactive Plotly visualizations by adding an i in front of the table method name that creates the default visual.
  • The arguments change to fit the Plotly functions.

[Optional] Demo: Visualizing Du Bois with Plotly¶

Create the interactive version of the bar chart.

In [13]:
du_bois_for_bar.ibarh(
    column_for_categories='CLASS',
    title='W.E. Du Bois Income and Expenditure',
    xaxis=dict(tickformat='0.1%')
)

Plotly has an easy way to stack the bars to create an overlaid bar chart.

In [14]:
# barmode and xaxis are available with ibarh because they are a Plotly arguments
fig = du_bois_for_bar.ibarh(
    column_for_categories='CLASS',
    barmode="stack",
    title='W.E. Du Bois Income and Expenditure',
    xaxis=dict(tickformat='0.1%')
)

We are starting to get something that looks like Du Bois's visual, but let's stop there because this is optional for this class. If you like creating visualizations, try to read through the Plotly documentation or Matplotlib documentation to update the colors, add overlaid text, etc.


Numerical Data¶


Visualizing the Distribution of One Numerical Variable¶

Histograms tbl.hist are a standard way to visualize the distribution of one numerical variable.

Histograms will be focused on in the next lecture.


A Histogram¶

In [15]:
top_movies.hist('Year', unit="Year") 

# Some extra graph formatting you are not responsible for
plt.title('Distribution of Total Gross')
plt.show()
No description has been provided for this image

Plotting Two Numerical Variables¶

Line graphs tbl.plot and Scatter plots tbl.scatter are standard ways to visualize the relationship of two numerical variables.


A Line Graph¶

In [16]:
movies_per_year = top_movies.group('Year').relabeled('count', 'Number of Movies')
movies_per_year.where('Year', are.above(1999)).plot('Year', 'Number of Movies') 

# Some extra graph formatting you are not responsible for
plt.xticks(np.arange(2000, 2023, 5))
plt.title('Number of Movies vs. Release Year')
plt.show()
No description has been provided for this image

A Scatter Plot¶

In [17]:
actors = Table.read_table('actors.csv')
actors.scatter('Number of Movies', 'Average per Movie')

# Some extra graph formatting you are not responsible for
plt.title('Average Pay per Movie (Thousands of Dollars) vs. Number of Movies')
plt.show()
No description has been provided for this image

When to use a line vs scatter plot?¶

  • Use line plots for sequential data if:
    • ... your x-axis has an order
    • ... sequential differences in y values are meaningful
    • ... there's only one y-value for each x-value
  • Usually: x-axis is time or distance
  • Use scatter plots for non-sequential data --- When you’re looking for associations

Demo: Scatter Plots¶

Visualize the relationship between the IMDb rating and the number of votes for the movies in top_movies.

In [18]:
top_movies.show(3)
Created Modified Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Gross Gross (Adjusted)
2/3/23 2/9/23 Avatar https://www.imdb.com/title/tt0499549/ Movie 7.9 162 2009 Action, Adventure, Fantasy, Sci-Fi 1402235 12/18/09 James Cameron 2923706026 4.15247e+09
2/3/23 8/9/23 Avengers: Endgame https://www.imdb.com/title/tt4154796/ Movie 8.4 181 2019 Action, Adventure, Drama, Sci-Fi 1298188 4/26/19 Anthony Russo, Joe Russo 2799439100 3.33648e+09
2/3/23 5/23/23 Avatar: The Way of Water https://www.imdb.com/title/tt1630029/ Movie 7.5 192 2022 Action, Adventure, Fantasy, Sci-Fi 509990 12/16/22 James Cameron 2320246732 2.41576e+09

... (97 rows omitted)

In [19]:
top_movies.scatter('IMDb Rating', 'Num Votes')

# Some extra graph formatting you are not responsible for
plt.title('IMDb Rating vs. Number of Votes')
plt.show()
No description has been provided for this image

[Optional] Demo: Interactive Scatter Plots¶

Again, for all the visualization methods we use from the datascience library, if you put an i infront of the name of the visualization, you can access an interactive version of plot that is based on another visualization library called Plotly. You will not be tested on your knowledge of these interactive plots. You might find them helpful for exploring the data.

In [20]:
top_movies.iscatter(
    column_for_x='IMDb Rating', 
    select='Num Votes', 
    labels='Title', 
    title='IMDb Rating vs. Number of Votes'
)

Attribution¶

This content is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) and derived from the Data 8: The Foundations of Data Science offered by the University of California, Berkeley.

No description has been provided for this image